What are they?
Binary variables taking \(1\) if the observation has some attribute, \(0\) otherwise
How do we make them?
R with as.factor()lm(y ~ as.factor(category_var))
What do they do?
summary(lm(INCEARN ~ sex, acs_data))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 142586.6 2360.005 60.41793 0.000000e+00
## sexMale 61122.7 2879.125 21.22961 7.209473e-98
We can include all categories and drop the intercept
summary(lm(INCEARN ~ -1 + sex, acs_data))$coefficients## Estimate Std. Error t value Pr(>|t|)
## sexFemale 142586.6 2360.005 60.41793 0
## sexMale 203709.3 1649.163 123.52281 0
How does our interpretation change?
## Estimate Std. Error t value Pr(>|t|)
## sexFemale 142586.6 2360.005 60.41793 0
## sexMale 203709.3 1649.163 123.52281 0
We can consider a case where there are several categories:
Brexit Vote: What are regional differences in vote for Brexit?
Five regions: Southern England, Midlands, Northern England, Wales, and Scotland
How do you interpret this?
summary(lm(Pct_Lev ~ Pct_Trn + Region5, brexit))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.61020359 7.7829764 5.7317665 2.048832e-08
## Pct_Trn 0.08933588 0.1027488 0.8694588 3.851539e-01
## Region5The Midlands 8.57606186 1.2532253 6.8431923 3.179826e-11
## Region5Northern England 6.35984691 1.3248565 4.8004045 2.293393e-06
## Region5Wales 2.30867033 2.0441434 1.1294072 2.594499e-01
## Region5Scotland -11.60364022 1.8478179 -6.2796448 9.422928e-10
How do you interpret this?
summary(lm(Pct_Lev ~ Pct_Trn + Region5, brexit))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.18626546 7.7855372 6.8314188 3.420699e-11
## Pct_Trn 0.08933588 0.1027488 0.8694588 3.851539e-01
## Region5Northern England -2.21621496 1.5559745 -1.4243260 1.551859e-01
## Region5Wales -6.26739154 2.2030506 -2.8448696 4.687357e-03
## Region5Scotland -20.17970209 2.0149071 -10.0152022 4.494699e-21
## Region5Southern England -8.57606186 1.2532253 -6.8431923 3.179826e-11
How do you interpret this?
summary(lm(Pct_Lev ~ Pct_Trn + Region5, brexit))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 50.97005050 7.3635671 6.9219239 1.946734e-11
## Pct_Trn 0.08933588 0.1027488 0.8694588 3.851539e-01
## Region5Wales -4.05117658 2.1752891 -1.8623624 6.333598e-02
## Region5Scotland -17.96348713 1.9097366 -9.4062643 5.303722e-19
## Region5Southern England -6.35984691 1.3248565 -4.8004045 2.293393e-06
## Region5The Midlands 2.21621496 1.5559745 1.4243260 1.551859e-01
How do you interpret this?
summary(lm(Pct_Lev ~ Pct_Trn + Region5, brexit))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.91887392 7.6346802 6.1454930 2.043370e-09
## Pct_Trn 0.08933588 0.1027488 0.8694588 3.851539e-01
## Region5Scotland -13.91231055 2.4939139 -5.5785047 4.660849e-08
## Region5Southern England -2.30867033 2.0441434 -1.1294072 2.594499e-01
## Region5The Midlands 6.26739154 2.2030506 2.8448696 4.687357e-03
## Region5Northern England 4.05117658 2.1752891 1.8623624 6.333598e-02
How do you interpret this?
summary(lm(Pct_Lev ~ Pct_Trn + Region5, brexit))$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.00656337 7.2248567 4.5684731 6.681290e-06
## Pct_Trn 0.08933588 0.1027488 0.8694588 3.851539e-01
## Region5Southern England 11.60364022 1.8478179 6.2796448 9.422928e-10
## Region5The Midlands 20.17970209 2.0149071 10.0152022 4.494699e-21
## Region5Northern England 17.96348713 1.9097366 9.4062643 5.303722e-19
## Region5Wales 13.91231055 2.4939139 5.5785047 4.660849e-08
How do you interpret this?
summary(lm(Pct_Lev ~ -1 + Pct_Trn + Region5, brexit))$coefficients## Estimate Std. Error t value Pr(>|t|)
## Pct_Trn 0.08933588 0.1027488 0.8694588 3.851539e-01
## Region5Scotland 33.00656337 7.2248567 4.5684731 6.681290e-06
## Region5Southern England 44.61020359 7.7829764 5.7317665 2.048832e-08
## Region5The Midlands 53.18626546 7.7855372 6.8314188 3.420699e-11
## Region5Northern England 50.97005050 7.3635671 6.9219239 1.946734e-11
## Region5Wales 46.91887392 7.6346802 6.1454930 2.043370e-09
In R: you can shift the reference category using the following:
levels = c('Scotland', 'Southern England', 'The Midlands', 'Northern England', 'Wales')
brexit$Region5 = factor(brexit$Region5, levels = levels)
The first category is dropped (in this case, Scotland)
Hooke’s law is a law of physics that states that the force (\(F\)) needed to extend or compress a spring by some distance \(x\) scales linearly with respect to that distance.
So the length of a spring is linearly propotional the force applied. For a given weight \(X_i\) on a spring, the length \(Y_i\) is:
\[Y_i = \alpha + \beta X_i + \epsilon_i\] for \(i\) in \(1 \ldots n\)
\[Y_i = \alpha + \beta X_i + \epsilon_i\] for \(i\) in \(1 \ldots n\)
This is a statistical model:
We use observed data to estimate the parameters \(\alpha, \beta, \sigma^2\)
The statistical model for the regression makes certain assumptions about the process generating the data.
To find a line, we create a statistical model for a linear equation that describes the process generating the data.
\[Y_i = \alpha + \beta X_i + \epsilon_i\]
deterministic (not random) \(\alpha + \beta X_i\)
stochastic (random error) \(\epsilon_i\)
\(\alpha\): parameter for the intercept (fixed for all cases)
\(\beta\): parameter for the slope (fixed for all cases)
\(X_i\): value of independent variable (input into the equation)
\(Y_i\): value of dependent variable (determined by the equation)
\(\epsilon_i\): random error for observation \(i\)
x = rnorm(100)
y = 3 + x * 0.5
plot(x, y, main = bquote(Y[i] == alpha + beta * X[i]))
abline(a = 3, b = 0.5)y_star = 3 + x * 0.5 + rnorm(100)
plot(x, y_star, main = bquote(Y[i] == alpha + beta * X[i] + epsilon[i]))
abline(a = 3, b = 0.5)This regression model:
\[Y_i = \alpha + \beta X_i + \epsilon_i\]
is linked to the least-squares calculations we did the last two weeks. The regression model can be estimated with least squares:
\[Y_i = \hat{\alpha} + \hat{\beta} X_i + e_i\]
where \(Var(e) = \widehat{\sigma^2}\)
Differences:
Another key assumption: (if \(X\) is also a random variable) \(X \mathrel{\unicode{x2AEB}} \epsilon\)
If we put a heavier weight (large \(X_i\)) and that increased likelihood of large \(\epsilon_i\) and lighter weight (small \(X_i\)) increased likelihood of smaller \(\epsilon_i\), then we would over-estimate \(\beta\).
Model equation is correctly specified (correct functional forms, all relevant variables are included). \(Y_i = \alpha + \beta X_i + \epsilon_i\) is the true data-generating function
\(\epsilon_i\) are independently, identically distributed (i.i.d.), \(E(\epsilon_i) = 0\) and variance of \(\sigma^2\). homoskedastic vs. heteroskedastic
\(\epsilon_i\) is independent of \(X_i\): \(X \mathrel{\unicode{x2AEB}} \epsilon\)
We cannot empirically verify these assumptions.
When using least squares to estimate this model, we mathematically impose \(e_i\) to be orthogonal to \(x_i\) and \(E(e_i) = 0\). But this is by assumption, not a fact.
If these assumptions are a good model for the data we observe, then with large \(n\), we can estimate our parameters \(\alpha, \beta, \sigma^2\) using least squares.
In the multivariate situation, the statistical model for regression is:
\[Y = \mathbf{X\beta} + \epsilon\]
Each row of this becomes:
\[Y_i = \mathbf{X_i\beta} + \epsilon_i\]
Statistical Assumptions (statistical model):
Mathematical Assumptions (least squares):
Statistical Assumptions (statistical model):
We want to estimate \(p + 1\) parameters… (what are they?)
\(\beta\) estimated by \(\widehat{\beta}\) using least squares:
\[\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\]
Note: \(\widehat{\beta}\) is a random vector; each element is a random variable. Why?, recall \(Y = \mathbf{X\beta} + \epsilon\)
By assumption: \(Y = \mathbf{X\beta} + \epsilon\), so:
\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}\)
\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'(\mathbf{X\beta} + \epsilon)\)
\(\widehat{\beta} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)
\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)
\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\)
So:
\(E(\widehat{\beta}|X) = E(\mathbf{\beta}|\mathbf{X}) + E((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon|\mathbf{X})\)
\(E(\widehat{\beta}|X) = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon|\mathbf{X})\)
And because, by assumption, \(X \mathrel{\unicode{x2AEB}} \epsilon\)
\(E(\widehat{\beta}|X) = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon)\)
And because, by assumption, \(E(\epsilon) = 0\)
\(E(\widehat{\beta}|X) = \mathbf{\beta}\)
Thus, if our model assumptions are correct, \(\widehat{\beta}\) is an unbiased estimate of parameter \(\beta\)
We need to find a way of finding the uncertainty in our estimate \(\widehat{\beta}\)
\[\scriptsize{\begin{pmatrix} Var(\beta_1) & Cov(\beta_2,\beta_1) & Cov(\beta_3,\beta_1) &\ldots & Cov(\beta_p,\beta_1) \\ Cov(\beta_1,\beta_2) & Var(\beta_2) & Cov(\beta_3,\beta_2) & \ldots & Cov(\beta_p,\beta_2) \\ Cov(\beta_1,\beta_3) & Cov(\beta_2,\beta_3) & Var(\beta_3) & \ldots & Cov(\beta_p,\beta_3) \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ Cov(\beta_1,\beta_p) & Cov(\beta_2,\beta_p) & Cov(\beta_3, \beta_p) & \ldots & Var(\beta_p)\end{pmatrix}}\]
How do we use the variance-covariance matrix?
The square-root of diagnonal elements (variances) gives standard error for each parameter estimate in \(\widehat{\beta}\) (hypothesis testing)
The off-diagonal elements can help answer: \(\beta_2 + \beta_3 \neq 0\). We need \(Cov(\beta_2, \beta_3)\) to get \(Var(\beta_2 + \beta_3)\).
\(\widehat{\beta} = \mathbf{\beta} + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon\), So:
\[cov(\widehat{\beta}|X) = E((\widehat{\beta} - \beta)(\widehat{\beta} - \beta)' | X)\]
\[ = E( ((\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\epsilon)((\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}'\epsilon)' | X)\]
\[ = E( ((\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\epsilon)(\epsilon'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}) | X)\]
\[ = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'E(\epsilon\epsilon'|X)\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]
What really matters here is: \(E(\epsilon\epsilon'|X)\)
\[E(\epsilon\epsilon'|X) = \sigma^2\mathbf{I}_{p\times p}\]
\(\epsilon\epsilon'\) is \(n \times n\) matrix with \(ij\)th elements \(\epsilon_i\epsilon_j\)
Taking the expectation:
\[ = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\sigma^2\mathbf{I}_{n\times n}\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\]
\[ = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\] \[cov(\widehat{\beta}|X) = \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\]
Can we start computing standard errors for \(\widehat{\beta}\) now? What is missing?
\(\hat{\sigma^2} = \frac{\sum\limits_i^n e_i^2}{n - p}\)
and \(E[\hat{\sigma^2} | X] = \sigma^2\) (unbiased)
so, we estimate the variance-covariance matrix:
\(\widehat{cov}(\widehat{\beta}) = \widehat{\sigma^2}(\mathbf{X}'\mathbf{X})^{-1}\)
THe variance (first diagonal element of \(\widehat{cov}(\widehat{\beta})\)) is
\(Var(\hat{\beta_p}) = \frac{\hat{\sigma^2}}{\sum\limits_i^n (X^*_{i} - \overline{X^*})^2}\)
So the standard error is:
\(SE(\hat{\beta_p}) = \sqrt{Var(\hat{\beta_p})} = \frac{\widehat{\sigma}}{\sqrt{n}\sqrt{Var(X_p^*)}}\)
99 in control, 1 in treatment
## [1] 0.01
50 in control, 50 in treatment
## [1] 0.2525253
In sum:
For these to be correct, we must assume:
If any assumption is wrong, unbiasedness of \(\widehat{\beta}\) and standard errors will be incorrect.
Hooke’s Law
seems reasonable, since the law is linear in the weights placed on the spring
are errors in measurement of length, weights, or other procedures that affect length of the spring independent and share the same variance?
are the errors unrelated to the weight placed on the spring?
When do things go wrong?
\(H_\emptyset: \beta = \beta^*\) \(H_a: \beta \neq \beta^*\)
Typically, we take \(\beta^* = 0\), but in principle it can be any value.
Because we estimate the standard errors, use a \(t\) statistic:
\(t_{n-p} = \frac{\hat{\beta} - \beta^*}{se(\hat{\beta})}\)
Then obtain a \(p\) value (one-sided or two-sided, depending on \(H_a\))
If we are using a \(t\)-test, don’t we need to make assumptions about asymptotic normality? Yes.
Two approaches
If we assume \(\epsilon\) is distributed normally \(N(mean = 0, variance = \sigma^2)\), then sampling distribution of \(\hat{\beta}\) is always normal.
If \(\epsilon\) not normally distributed: under some believable and verifiable assumptions about the data and assumption that observations are independent, then \(\lim_{n\to\infty}\) sampling distribution of \(\hat{\beta}\) is normally distributed by the central limit theorem. This is true if \(n\) is moderately large and data is not highly non-normal. If that happens, we can always bootstrap.
Regress Brexit Vote on Rainfall
#Make design-matrix
X = cbind(1, brexit$total_precip_polling)
#Make outcome matrix
Y = brexit$Pct_Lev
#Solve for Beta
beta_mat = solve(t(X) %*% X) %*% t(X) %*% Y
#Create y_hat
y_hat = X %*% beta_mat
#Create residuals
e = Y - y_hat#Estimate sigma squared
sigma2_hat = sum(e^2) / (nrow(X) - ncol(X))
#Estimate variance-covariance
var_covar = sigma2_hat * solve(t(X) %*% X)
var_covar## [,1] [,2]
## [1,] 0.39786504 -0.012415453
## [2,] -0.01241545 0.001306648
#beta standard errors:
beta_ses = var_covar %>% diag %>% sqrt
#get coefficient, standard errors, and t-stats
cbind(beta_mat, beta_ses, beta_mat / beta_ses)## beta_ses
## [1,] 54.1460994 0.63076544 85.841894
## [2,] -0.1058053 0.03614759 -2.927036
#check against `lm`
lm(Pct_Lev ~ total_precip_polling, brexit) %>% summary %>% .$coefficients## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.1460994 0.63076544 85.841894 5.311451e-250
## total_precip_polling -0.1058053 0.03614759 -2.927036 3.628998e-03
Let’s interpret the hypothesis test:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.1460994 0.63076544 85.841894 5.311451e-250
## total_precip_polling -0.1058053 0.03614759 -2.927036 3.628998e-03
For the intercept: is it “significant”? Is this informative in any meaningful way?
For the intercept: is it “slope”? Is this informative in any meaningful way?
Does the regression model apply in this context?
Is the model: \(Brexit \ Leave \ \%_i= \alpha + \beta \ mmRainfall_i + \epsilon_i\) correct?
Is \(\epsilon\) i.i.d. with mean \(0\) and variance \(\sigma^2\)?
Is \(\epsilon_i\) independent of \(mmRainfall_i\)?
Given these assumptions (sometimes called Gauss-Markov assumptions)
we can estimate \(\beta\) using \(\widehat{\beta}\) from least squares, and this estimate will be
But are these assumptions reasonable (compared to Neyman causal model)?